Robust reconstruction of single-cell RNA-seq data with iterative gene weight updates

Abstract Motivation Single-cell RNA-sequencing technologies have greatly enhanced our understanding of heterogeneous cell populations and underlying regulatory processes. However, structural (spatial or temporal) relations between cells are lost during cell dissociation. These relations are crucial for identifying associated biological processes. Many existing tissue-reconstruction algorithms use prior information about subsets of genes that are informative with respect to the structure or process to be reconstructed. When such information is not available, and in the general case when the input genes code for multiple processes, including being susceptible to noise, biological reconstruction is often computationally challenging. Results We propose an algorithm that iteratively identifies manifold-informative genes using existing reconstruction algorithms for single-cell RNA-seq data as subroutine. We show that our algorithm improves the quality of tissue reconstruction for diverse synthetic and real scRNA-seq data, including data from the mammalian intestinal epithelium and liver lobules. Availability and implementation The code and data for benchmarking are available at github.com/syq2012/iterative_weight_update_for_reconstruction.


Introduction
Single-cell RNA-sequencing (scRNA-seq) technologies have revolutionized our understanding of heterogeneous cellular populations and the underlying biological processes they encode (Wu et al. 2014;Potter 2018). However, the original spatial context of the cells (and the resulting spatial gene expression patterns) is lost due to the experimental procedure. While multiple computational methods have been developed to reconstruct spatial cellular context (Mages et al. 2023;Nitzan et al. 2019;Satija et al. 2015;Teves and Won 2020;Zeng et al. 2022), this task is challenging, partially because the cells simultaneously encode multiple spatiotemporal processes, and due to technical and biological noise. A common pre-processing step is considering only the highly variable genes as input for such reconstruction algorithms. However, these genes may not all be correlated with the same underlying process and thus may interfere with the reconstruction procedure.
The task of reconstructing cellular structure (or lowdimensional manifold in gene expression space) is tightly related to the task of identifying informative genes with respect to that manifold. On the one hand, a panel of manifold-informative marker genes can yield a metric similar to cell distances on the original structure, thus allowing for the clustering of cells that are close by in the original tissue (Droin et al. 2020). On the other hand, given an embedding of cells, one can recover informative genes with respect to that embedding by viewing individual gene expression as a function of cell location using, for example, regression-type algorithms (Droin et al. 2020). Here, we propose an algorithm that obtains enhanced reconstruction of single-cell data by alternating between these two tasks. By learning a representation of the subset of genes informative for a certain spatiotemporal process (or expressed smoothly along a certain manifold), we show that iterative gene weight update improves the reconstruction quality of existing algorithms, given the weighted gene expression as input, without relying on prior information.
We propose an algorithm that iteratively identifies manifold-informative genes, which can be integrated into existing reconstruction algorithms for single-cell data as a subroutine. Given single-cell RNA-seq data and a baseline learning algorithm that reconstructs gene expression along a spatial or temporal manifold, our algorithm iteratively updates the weight distribution over genes in the gene expression matrix taken as input to the reconstruction algorithm, to gradually increase the weight over genes related to the underlying reconstructed manifold. This leads to the identification of manifold-informative genes and a better reconstruction of the spatial or temporal manifold encoded by the cells. We demonstrate that such gene weight update improves the quality of reconstruction and identification of manifoldinformative genes in several settings, including (i) diverse synthetic single-cell datasets when integrated with an Autoencoder as the baseline algorithm and (ii) over scRNAseq datasets of mammalian intestine epithelium (Moor et al. 2018) and liver lobules (Droin et al. 2020), when integrated with either an Autoencoder or with novoSpaRc (Nitzan et al. 2019;Moriel et al. 2021) (an optimal-transport-based tissue reconstruction method) as the baseline algorithms.

Model description and assumptions
Let fx 1 ; . . . ; x N g 2 ðR n Þ N be gene expression profiles, where n :¼ number of genes and N :¼ number of cells. We assume that each cell, x, is represented by its locations on various low-dimensional manifolds M 1 ; . . . ; M k , each representing a different biological structure or process. Here, we use location in a generalized way that includes time and additional cell labels belonging to other types of cell structures. However, the true cell locations are generally lost due to the scRNA-seq experimental procedure. For example, the spatial location of cells within the tissue of origin is lost due to cell dissociation. A gene is correlated with, or informative with respect to a manifold M i if its expression at cell x can be written as a smooth function (e.g. a low-degree polynomial) of its location on M i .  where c i 2 M i . Let S i be the set of genes correlated to M i s.t. P i jS i j ¼ n. Suppose there exists a set of functions f 1 ; . . . ; f k s.t. f i : M i ! R jS i j and a permutation matrix p 2 f0; 1g nÂn . The gene expression at cell x is where fy x g are i.i.d. samples from a distribution over R n .
REMARK 1. In general, gene expressions are not deterministic functions and we do not assume a subset of genes correlated with the same manifold corresponds to a consecutive subset of coordinates of gene expression vector. In the above definition, we use p to include all gene orderings and fy x g to introduce randomness.
Denote X ¼ ½x 1 ; . . . ; x N 2 R nÂN as the input matrix where each column is a cell. Given X, the goal is to recover S i and the corresponding true cell locations fc i g.
Note that some gene g may not depend on any cell manifold M i . Those genes are not considered informative and we assume gene expression over cells is distributed as independent standard normal random variable.

Notation
Denote P n ¼ fp 2 R n þ : jjpjj 1 ¼ 1g to be the set of distributions over ½n ¼ f1; . . . ; ng. Let u n be the uniform distribution over [n]. For S & ½n, let v S 2 P n be the uniform distribution over S and p S ¼ jSjhp; v S i be the weight over the set of genes S. Let P c n & P n be the set of distributions with jjpjj 1 c.

Description of algorithm
To enhance the reconstruction of single-cell data along a certain spatial structure or temporal process, we propose to "boost" the performance of a weak reconstruction algorithm by iteratively shifting the emphasis, or the weight distribution over genes, toward the group of manifold-informative genes using multiplicative weight update (Arora et al. 2012). A pseudo-code description is given in Algorithm 1. Suppose we have a baseline algorithm that can reconstruct cellular location and gene expression along an underlying manifold M i when all the genes used as input to that algorithm correlate with that manifold and suppose we know the subset of genes S i . In that case, we can first project the input data X onto the subset of genes S i that correlate to M i and use X i as input to the baseline learning algorithm to reconstruct M i . Generally, if we do not know S i and the input gene expression matrix contains genes that are not correlated with M i , we expect the algorithm's error to increase. Therefore, our goal is to generate a mask over the input gene expression matrix, which emphasizes, or places a higher probability density, over S i . This motivates the following definition.
DEFINITION 2 (Baseline algorithm). Given X and a distribution p over f1; . . . ; ng, a weak algorithm A k learns a k dimensional projection E : R N ! R k and D : R k ! R N s.t. for some e > 0, the reconstruction error of A k , errðA k ðÁÞÞ, is bounded by some monotone function f k s.t. f k ðv S S Þ ¼ 0 where S is a subset of genes recoverable by A k and S ¼ ½n=S.
In this work, we use round to refer to an iteration in step 3 of Algorithm 1, where we train the baseline algorithm on reweighted data based on the current weights. Maximum number of rounds can be chosen based on application, for example, either terminate the algorithm when the improvement in loss drops below a predetermined threshold (similarly to gradient descent), or when the relative entropy drops below a certain threshold (indicating gene weights concentration on a subset of genes) ( Fig. 1c and g).

Baseline algorithms
A baseline reconstruction algorithm, as defined in Definition 2, is a subroutine to our proposed weight update algorithm. The goal of our algorithm is to enhance the performance of the baseline algorithm by focusing on manifold-informative genes for its respective reconstruction. We test our weight update algorithm over two baseline reconstruction algorithms: feed-forward Autoencoder and novoSpaRc.
Autoencoder (AE). AE takes as input the reweighted data matrix diagðpÞX, outputs an encoder E : R D ! R k , corresponding to a function that maps input gene expressions to a low-dimensional representation, and a decoder D : R k ! R D , corresponding to a function that computes the reconstructed gene expression from the low-dimensional representation. Autoencoders and their variants are widely used for singlecell reconstruction (Eraslan et al. 2019;Amodio et al. 2019;Tran et al. 2021). In this paper, we train a shallow AE with Tanh activation in each round to demonstrate the effectiveness of our weight update algorithm. The largest layer width is set to 64-128, and the code dimension is set to be the minimum cell manifold dimension. The number of epochs is set to 70-100. Our algorithm outputs the subset of genes that can be better reconstructed by AE, which generally corresponds to the subset of genes correlated with a lower dimensional cell manifold or is generated by a simpler class of functions.
novoSpaRc. novoSpaRc (Nitzan et al. 2019;Moriel et al. 2021) is a generalized optimal-transport-based method designed to spatially reconstruct single-cell data based on an interpolation between a structural correspondence assumption (corresponding to spatial smoothness of gene expression) and potentially a reference atlas (such as imaging data for a subset of marker genes). novoSpaRc was shown to successfully reconstruct gene expression patterns over diverse tissues, including the intestinal epithelium and liver lobules. We show that our weight update algorithm improves the reconstruction quality of novoSpaRc by making it more robust when input data contains noisy or less informative genes.

Alternative minimization and multiplicative update
Given a baseline algorithm as in Definition 2, we wish to find a distribution p that minimizes the reconstruction error of A k while not putting too much weight on any single gene. We do this via the multiplicative weight update algorithm with restricted distributions (Kale 2007). Let lðp; X; DðEðXÞÞÞ be a loss function that measures the quality of reconstruction. The problem can be rephrased as the following optimization problem: where jSj is the size of the subset of genes correlated with the current cell manifold. That is, we minimize the loss over distributions that put at most 1 ejSj weight on any particular gene. At each round t, we compute the optimal encoder and decoder for X reweighted by p tÀ1 and update p t by where Dðp; qÞ ¼ P i p i log pi qi is the relative entropy. The first term of Equation (1) increases p t ½i for genes with lower reconstruction loss. The second term prevents p t from concentrating on a small subset of genes. Solving for p t , where m t ¼ rlðp t ; X; D t ðE t ðXÞÞÞ. Finally, we project p t onto P 1 jSj n as follows: Let B be the set of genes with p t ½i > 1 jSj , set p t ½i ¼ 1 jSj and rescale p t [i] for i 2 B so that they sum to 1 À jBj jSj . Genes are partitioned into two sets S 1 ; S 2 based on the underlying manifold they depend on. We denote S 2 as a subset of genes that are captured by the AE. For D 1 , S 2 corresponds to the subset of genes dependent on Unifð½0; 1Þ 2 , and for D 2 , S 2 corresponds to the subset of genes whose expression can be expressed as degree 2 polynomials. where Y ¼ DðEðXÞÞ is the reconstructed data matrix. The reconstruction error for each gene is and we can rewrite the MSE loss as hm t ; p t i.
To check that p t is moving in the right direction, one observation is that the marginal of p t over S is non-decreasing ( Supplementary Information).
For novoSpaRc, we measure reconstruction error by the variance of gene expression over the output cell distribution.
Recall that a gene is strongly correlated with the cell distribution if cells that are mapped to similar true cell locations have similar expressions for this gene. Thus, a natural generalization of MSE loss would be the sum of the variance of the genes. In addition, we can also compute the variance of the derivative of the mean gene expressions to enforce the smoothness assumption of true gene expressions. Let l 1 ; . . . l h : ½N ! ½0; 1 be the cell distribution computed by novoSpaRc where h is the number of cell locations.
lðp; X; l 1 ; . . . ; Var l j ðX½iÞ where Var j is the empirical variance over true cell locations and k in [0, 1].

Evaluation
On Synthetic datasets, we evaluate the performance of the algorithm in the following way: 1) The encoder captures the true cell location for the corresponding cell manifold: we compute the mean and standard mutual information between the encoding of cells and gene expressions for each subset of genes.
2) The algorithm achieves better reconstruction of gene expression after each update of p and D, E: reflected by a decrease in the MSE loss after each update.
3) The algorithm recovers the correct subset of informative genes: reflected by a decrease in the relative entropy between the ground truth distribution and p t , Dðv S ; p t Þ. 4) Round p t to recover the correct subset of genes: AUC score increases. By giving all genes s.t. p T ½i ⁄ p 0 ½i the same label, we show that the true positive rate remains ! 90% while the false-positive rate decreases after 10 iterations.
For the real scRNA-seq datasets, we compute the Pearson's correlation coefficients between known and reconstructed mean gene expressions. We first run the baseline algorithm on pre-weighted (original) data (output of round 0) and show that the Pearson's correlation coefficients improve after $10 rounds of weight update. Table 2 is a summary of the simple synthetic single-cell datasets when genes correlate with a single-cell manifold. We first generated 2 Â 10 4 synthetic cell locations. For each cell manifold, we simulated single-cell RNA-seq data by generating random low-degree polynomials as functions for gene expression. Each gene is normalized to have a mean of 0 and a variance of 1. To obtain a mixture of datasets, we stack two datasets containing genes correlated with a single-cell manifold so that each cell contains genes correlated with two manifolds. Table 3 contains a summary of mixtures of datasets.

scRNA-seq simulations
When adding Gaussian noise to the data, we say it has noise scale b if the input is of the form where X is the noise-less data as described above and N is the normalized noisy data where each entry N½i; j $ N ð0; 1Þ. Each gene expression in X b has a mean of 0 and variance of 1.  Summary of reconstruction results on synthetic and scRNA-seq datasets for the baseline algorithms (Autoencoder and novoSpaRc) without and with the integration of the weight update algorithm as a subroutine. For real scRNA-seq datasets, we record the mean and variance of correlation coefficients between reconstructed and ground truth zonated gene expressions. For Synthetic datasets, we measure the MSE loss on different noise levels. Higher correlation coefficients and lower MSE loss indicate better gene expression reconstruction. In both cases, the integration of the weight update algorithm enhances the quality of reconstruction.

scRNA-seq datasets
We analyze two publicly available single-cell RNA-seq datasets. In both datasets, the reconstructed low-dimensional manifolds correspond to one-dimensional spatial axes within the tissues of origin (crypt-to-villus axis in the intestinal epithelium, portal-to-central vein in liver lobules). Generally, zonated genes are genes whose expression can be expressed as a function of the manifold corresponding to the spatial axes. The first dataset contains cells sampled along the crypt-tovillus axis in the mammalian intestinal epithelium (Moor et al. 2018). In the original study, cell locations are partitioned into seven spatial layers. Here, we denote zonated genes as highly expressed zonated genes with Q value < 0:05 as given by Moor et al. (2018).
where a, b, c are constants that are different for each gene. Notice that there is no need to explicitly model the gene expression for the use of the algorithm we propose for iterative weight updates: for the liver data it is used only for evaluation; for the intestine data it is not used at all. In both cases, gene expression was pre-processed and normalized, as commonly done for scRNA-seq reconstruction algorithms (Moor et al. 2018;Nitzan et al. 2019;Droin et al. 2020), by log-transform, and z-score normalization.

Reconstruction of synthetic single-cell datasets
We first evaluate the performance of our algorithm on various mixtures of synthetic single-cell datasets to showcase the identification of manifold-informative genes and an encoding of the corresponding cell location when signal-to-noise ratio < 1. The synthetic datasets (Table 3) are mixtures of basic single-cell datasets ( Table 2) that differ in cell manifolds dimension, distribution, and degree of gene expression, with noise scale b 2 ½0; 0:75. Employing iterative gene weight updates over single-cell reconstruction using an Autoencoder output improves both relative entropy and accuracy of reconstruction (Fig. 1, Table 1 and Supplementary Fig. S1). Specifically, iterations of gene weight updates lead to the gradual decrease in relative entropy between the output distribution and ground truth, decrease in gene expression reconstruction error, increase in AUC score for gene expression reconstruction, and decrease in false-positive rate (and consistently high, > 93%, true positive rate) for manifoldinformative genes (where in each case of the synthetic mixtures, the correct subset of manifold-informative genes could be obtained by thresholding their weight by 1Àd n for some small d ¼ 0:2) ( Fig. 1 and Supplementary Figs S1-S5). The quality of reconstruction and gene partition gradually decreases with the increasing noise level, as expected ( Fig. 1b and f and Supplementary Figs S1b and f and S2b-S5b).

Reconstruction of scRNA-seq intestinal epithelium data
Here, we show that coupling reconstruction to iterative gene weight update as a subroutine can improve the quality of tissue reconstruction based on scRNA-seq data. We first focus on scRNA-seq data from the mammalian intestinal epithelium (Moor et al. 2018). To capture spatial gene expression along the crypt-to-villus axis, we generated a one-dimensional embedding of the cells, based on the top 1000 highly variable genes (approximately 50% of which are labeled as zonated by Moor et al. 2018), using both the baseline Autoencoder and novoSpaRc. In both cases, iterative weight updates gradually improve the quality of tissue reconstruction (Fig. 2). Specifically, when tracking the correlation between the reconstructed and measured gene expression values, we find that the quality of reconstruction of spatially zonated genes improves with each iteration of the gene weight update (Table  1). Following 10 rounds of gene weight update, the reconstruction quality matches that of training only on zonated, highly variable genes (Fig. 2).
Furthermore, our algorithm gradually shifts the distribution of weights during training, as shown by the decrease in relative entropy between the current weight distribution and uniform distribution over zonated genes with weight update rounds ( Supplementary Fig. S8). While novoSpaRc's quality of reconstruction is higher than that of the baseline Autoencoder, the qualitative effects described in this section are shared by both algorithms (Fig. 2).

Reconstruction of scRNA-seq liver data
We next focus on a more challenging scenario, when genes change along two structures or processes, encoded by multiple cellular manifolds, simultaneously. This is the case for hepatocytes within liver lobules, whose expression changes across both physical space (along the lobule axis) and circadian time, as reflected in a recently collected scRNA-seq dataset (Droin et al. 2020).
We construct a one-dimensional embedding of the cells using both an Autoencoder and novoSpaRc as baseline algorithms. We use the top highly variable genes (200 for AE and 500 for novoSpaRc) out of 15,000 genes for training, approximately 50% of which are zonated (Z, Z þ R, or Z Â R) genes. Applying our algorithm for iterative weight updates gradually enhances the quality of reconstruction of zonated genes (Z, Z þ R, Z Â R) in the de novo setting (without using a reference atlas) (Table 1), and achieves comparable results to those obtained by using a reference atlas as prior knowledge (Fig. 3). This occurs due to the increased weight density over spatially informative genes achieved by the iterative weight update. Furthermore, we are able to recover the correct labels for most zonated genes (Z, Z þ R, Z Â R) using the final cell embedding output by the algorithm, as described below. Similarly to Droin et al. (2020), we recover gene labels from the cell embedding by using a linear regression model to approximate (reconstructed) mean gene expression by lowdegree polynomials of s and t. In (Droin et al. 2020), a mixed regression model was used to approximate measured mean gene expression, however, since we focus here on spatially informative genes whose expression changes monotonically with space, the zonated part of gene expression could be approximated by a monotone function captured by linear regression. Thus, we fit each mean gene expression to a linear function and label genes that can be approximated well (R score > 0:1) and are informative with respect to the spatial manifold (slope > 0:001). Our algorithm achieves 93:7% true positive rate on purely zonated genes and > 80% of true positive rate on mixed zonated genes (Fig. 3g).

Discussion
Our algorithm for iterative gene weight update improves the accuracy of single-cell RNA-seq reconstruction. Our algorithm can reconstruct gene expression, including the case when Figure 2. Iterative gene weight updates enhance spatial reconstruction of the intestinal epithelium. (a, g) Quality of reconstruction measured by correlation coefficients following weight updates when using an Autoencoder (a) or novoSpaRc (g) as a baseline algorithm. z hvg corresponds to the output when trained on the subset of zonated, highly variable genes. (b-l) Mean gene expressions for sample genes from each of the five clusters of zonated genes in (Moor et al. 2018), reconstructed by novoSpaRc (h-l) and AE (b-f) after 0 and 10 rounds of update (novo or AE 0 or 10, respectively), compared to when only trained on zonated highly variable genes (novo or AE Z). i428 Sheng et al.
multiple cell manifolds are simultaneously encoded, without depending on prior information such as manifold-informative marker genes. We showcase these results over two baseline reconstruction algorithms, an Autoencoder and novoSpaRc, and over synthetic single-cell data as well as two scRNA-seq datasets of mammalian intestinal epithelium and liver lobules. Moreover, our algorithm can be incorporated into a general baseline reconstruction algorithm, treating it as a blackbox. Therefore, while in this manuscript we focus on enhancing spatial reconstruction algorithms that are based on scRNA-seq data alone, our algorithm for iterative weight update can also be applied to the reconstruction of underlying structures in the data corresponding to non-spatial processes such as temporal trajectories (see benchmarking of trajectory inference methods in Saelens et al. 2019), as well as algorithms that combine multiple types of experimental data, such as scRNA-seq and spatial transcriptomics data for spatial reconstruction. While our algorithm requires the selection of an appropriate loss function for the chosen baseline algorithm that effectively measures the reconstruction quality of each gene, it can generally be informed by the objective and characteristics of the baseline algorithm.

Supplementary data
Supplementary data is available at Bioinformatics online.

Data availability
The scRNA-seq datasets used for this study were acquired from the Gene Expression Omnibus (GEO) database with the following accession numbers: liver (GSE145197), intestine (GSE109413). Reconstructed mean gene expression for the liver data was downloaded from (https://github.com/naef-lab/ Circadian-zonation). Zonated gene labels for the intestine were downloaded from (https://zenodo.org/record/1320734). (a-f) Example outputs of the recovered mean gene expression based on the iterative weight update algorithm, compared to the true mean gene expression. ZT are ground truth mean gene expressions and reconst ZT are reconstructed mean gene expressions after 10 rounds of weight update using AE as the baseline algorithm. (g) Percentage of zonated genes labeled by the weight update algorithm for Z, Z þ R and Z Â R genes as well as non-zonated genes. Error bars are computed over five runs of training. (i, h) The improvement in reconstruction quality was measured by correlation coefficients after 10 rounds of weight updates for AE (i) and novoSpaRc (h). z hvg corresponds to the output when trained on the subset of zonated highly variable genes.
Robust scRNA-seq reconstruction with gene weight updates i429